DOCUMENT RESUME 



ED 079 385 



TM 002 985 



AUTHOR 
TITLE 

INSTITUTION 
REPORT NO 
PUB DATE 
NOTE 

JOUMhL CIT 

EDRS PRICE 
DESCRIPTORS 



ABSTRACT 



Wesman, Alexander G. 

Comparability Vs* Equivalence of Test Scores. 
Psychological Corp., New York, N.Y* 
Bull-53 
Sep 58 

4p. ; Reprint from Test Service Bulletin 
Test Service Bulletin; n53 p6-9 Sep 1958 

MF-':60.65 HC-$3.29 

Bulletins; Norms; ♦Scores; Standardized Tests; *Test 
Interpretation; *Test Results 



Comparable scores represent equal rank in ?. jiven 
population- they imply nothing concerning what is being meas M^ed. 
Equivalent scores represent similarity of what is being measured — 
the more complete the equivalence, the greater the likeness of 
measurement. The test user would do well to keep these distinctions 
in mind; to avoid being confused into accepting comparable scores as 
dehbting equivalence of measurement; to insist on close 
approximations to complete equivalence in alternate forms; and to 
re^gnize that in substituting one test for another he may prefer 
rough equivalence to more precise equivalence if he is seeking to 
improve validity. (Author) 



in 

OD 

o 
o 



FILMED FROM BEST AVAILABLE COPY 




Ui ObPAfcTMENTOF HEALTH. 
COUCATiONiWELfA?'^ 
NATIONAL INSTITUTE OF 
tM.c eOUCATtON 

?TA^ Pn nn^Ztt^^ ^'^^ OPINIONS 
tlt.ll^ ^° NECESSARILY PEPftt 

eC'iCAlfON POSiT, ON OR POLICY 

St Service 




iillei 



If * 



No. 53 



THE I'SYCHOLOCnCAI, CORPORATION 



sScptcuibur, 1958 



Published from limc to time i:i the interest of inomotini* gieater tuuh rstuiulhiii of tin i^rituiples and techniques 
of ment(d nuu^nurcment and its afnAlcatiuns in guidance, personnel y.wL, ana tUimal ps\cholog\\ and for 
announcing new publications of interest, Addiess comnumications to MJd East 45th Strict, New YorL 17, K. Y. 



Harold G. Si-ASHORn, Editor 

Director of the Tcsi Division 

AlI-XANDHK G. WtSMAN 

Associate Dinxtor of the Test Division 



Jerome E. Dopphlt 

Assistant Director 

Jamhs H. Ricks, Jn. 

Assistant Director 



DOROTU/ M. CLKNDnNTiN 
Assistant Director 

ESi'IIuK H. HOLUS 

Advisory Service 



COMPARAlilliTY VS. EQUIWiLENCE OF TEST SCORES 



(X) 

ai 

c 
c 



r|riHE ViJgarics of the English language must be a source of considerable bewilderment to those who arc faced 
^ suddcnlv v ith the need to learn our tongue. How does one Jearn what *Tjx'' means? The mar.ner wKowihhes to 
determine his position gets a fix; the professional crook seeks a game that he can fix; the squeaky door needs to 
be fixed; a coinmittec chairman fixes a date; a student eyes has professor with a fixe 1 staic; a culprit is found and 
blame is fixcd-and the culprit finds himself in a fix. So, too, the special l:»ngua3C of tests and measurements conta. tis 
some ambiguities. But lest tliey interfere with clear understanding of important concepts, ambiguities ought to be 
clariiied. Twu of the common words in tiic te>ting field wliieii arc surruunded b) coriUbiun arc "conjparabic" and 
'^equivalent." 

Two tc:>t scores are eqtdvalcnt if either can properly be substituted for the other. Botl) the trait being measured 
and the meanr. of measurement must correspond. Two scores may be compautble, on the other hard, yet reflect veiy 
dissimilar abifities. In fact, the scores may be numerically quite difiercnt - may even be expj*.ssed in different units — 
and still be comparable. 



Comparability properly refers merely to rank in a 
gro.up; the teim carries no connotation with respect to 
what is being measured. For example, within the Differ- 
ential Aptitude Tests a score of 56 on the Clerical Speed 
and Accuracy test is comparable for certain individuals 
to a score of 47 on the Mechanical Reasoning test. In 
each rase, the score represents the 70th percentile for 
tenth grade boys in the populalion used in standardizing 
the tests. The central fact to be noted is that two scores 
are comparabk ij they represent the same standing in 
tlie same pojndation. There is no Implication that the 
scores denote the same, or even similar, abilities. Even 
casual inspection of the two tests reveals liow little they 
measure in common. In fact, the average correlation 
between the Mechanical and Clerical tests is about JO. 

If a low coefllcient of correlation between two sets of 
test scores doesn^t preclude comparability, neither docs 
a high one assure it. As indicated above, the size of the 



correlation coefilcient is irrelevant to the *nalter of com- 
parability. Scores on the DA T Numerical Ability test 
and the Stanford Arithmetic test are not comparable 
even thougli these two tests may be expected to corre- 
late about .75. Scores on these tests are not comparable 
because the tests were rnt standardized on tlic same 
population. For similar reasons, scores on tlic DAT 
Space Relations lest are not comparable to scoies on 
the Revised Minnesota Pat}er Fonn Board, although 
both are tests of space perception. It is not wliat the 
tests measure but thc~ population used in stardc>idizing 
the tests that determines compai ability. 

But, we may ask. if compai ability is meiely a matter 
of giving two tests to one population, cannot one make 
any two tests comparable by giving them to a single 
group? Yes, iiuleed. Any schov)l or business organization 
am develop sets of comparable scores by giving any 
two (or more) tests to its students or employec^. 



The contents of tins Btdtetin are not copyrighted; tite articles may he quoted or reprinted withoMt formality otiicr 
thantlieciistomary actino\vledgnient of the Test Service liuUetin of Tir: P^vcnoLOGiCAL Corporation as ttie source. 



6 



tiaSt servich bulletin 



Will jiuch data then be useful to other institutions? 
Th.if dv-pcnds on the le.scnihhince belv.ccn the group on 
which cuhipiiiability of scoieN was bciscd, and the gioup 
Willi which the result is to be usctl. If the groups 
ate «.trii^;ently alike, a tabic of conipaiablc scores will 
apply thvat as well to the second as it docs to the first 
grou} Ih on the other hand, the .two groups are unlike 
in some inipoitant respect (e.g.. age, sex. education, rele- 
vant th/iionnicnt, etc.). ii may be inadvi.sable to assume 
that tho :ublc of eomp..ial}lc scores \vi!I apply as well 
to tlie second group. Vox example, among tenth grade 
boy.s a Store of 44 on the DAT Mechanical Reasoning 
test is cop.iparablc to a .score of 34 on the DAT Sen- 
tences test; both arc at the sixtieth percentile for this 
norms group Among tenth grade girls, the same Me- 
chanica! P^casoning score of 44 is conipaiable to a Sen- 
— tcnccs SQOiC of 66; both are at the ninety-fifth percentile 
for girls in the tenth grade. Like noims and validity, 
eomp.ii ability is specific to the group on whicji the data 
arc obtaiiicd. 

It may seem sinprising that two scores wiiich repre- 
sent equal standing in one group may reflect quite differ- 
ent sUuidings in another gioup. Some tlioughtful consid- 
eration, however, will make it evident that sucK variations 
in comparability should be expected. An example may 
Jieip to iiiuininate the issue. Let us suppose that a test of 
English grammar and a test of reading comprehension in 
French have been administered to two groups of stu- 
^dents. Ciroup Axonsists of freshmen who have had only 
three months of exposure to the learning of French; 
Group 13 consists of sophomores who have just com- 
pleted two years of course work in the subject. We now 
prepare distributions of scores for the pair of tests and 
then compute percentiles to show what per cent of stu- 
dents fall below each score on each tcst.'»Wc compute 
these percentiles separately for the freshmen and sopho- 
mores. For the freshmen, wc find the score at the 50th 
percentile on the English grammar test, and the score at 
the 50th percentile on the French reading comprehen- 
sion test. These two scores are comparable — jor the 
jrcshmeu. What happens when we seek similarly com- 
parable scores for the sophomores? On the English 
grammar test the score which is at tne 50th percentile 
for sophomores is likely to be a little higher than the 
median for freshmen. The French comprehension score 
at the 50lh percentile for sophomores is likely to be very 
much higher than the score at the 50tli percentile for 
freshmen. The increased knowledge of French repre- 
sented by the additional year and two-tliirds of study will 
have a far greater effect on the French test scores than an 
additional year of exposure to English. We may expect, 
then, that the French score comparable to a particular 
score in English will be appreciably higher for sopho- 
mores than for freshnicn. 



Table I has been prcpaicd to illustrate the situation. 
Inspection of the table shows that, for freshmen who 
have studied Ficnch for three months, an English gram- 
mar score of 5« is coinparablc to a French comprehen- 
sion score of 64. 1'or sophomoics who have finished two 
years of French, however, an English granunar score 
of 58 is comparable to a French comprchcn.sion score 
of 73-a substantial difference. Clearly, any attempt to 
apply these ficshman data on comparability to the sopho- 
mores would result in serious error. Proper interpreta- 
tion of comi)arable scores requires that we know^ the 
characteristics of the group on which comparabilitj was 
established. If wc wish to apply published tables of com- 
parable scores to our local population, we n^ed to assure 
ourselves that the groups are sulllcicntly similar to permit 
such generalization. 

Perhaps the most important distinction between "com- 
parability** and ^^equivalence'* is that, whereas test con- 
tent is irrelevant to comparability, test content is finula^ 
mental to equivalence. Two test scores are equivalent if 
they can properly be substituted for one another. Essen- 
tially, this means that scores from one test must represent 

TAIH.E L Illustrative Norms for Two Groiipj^. 



I'cfceiililc 


Fi!i:s:!-.i;:\ 




English . 
(>riiiiiiiiar 


I'rciU'li 
Ikcuilirig 


C»i:;ttiini:!r 1 




99 


82 


88 


82 


98 


97 


78 


85 


79 


95 


V'5 


75 


82 


76 


92 


90 


72 


79 


74 


89 


85 


70 


77 


72 


86 


80 


68 


75 


70 


84 


75 


67 


73 


68 


83 


70 


65 


72 


67 


81 


65 


64 


70 


66 


80 


60 


63 


69 


64 


79 


55 


62 


68 


63 


78 


50 


61 


67 


62 


76 


45 


59 


66 


61 


75 


40 


58 . 


. 64 


60 


74 


35 


57 


63 


58. . 


73 


30 


56 


62 


57 


72 


25 


55 


60 


56 


70 


20 


53 


59 


54 


68 


15 


51 


57 


52 


66 


10 


49 


54 


50 


64 


5 


46 


51 


47 


61 


3 


44 


48 


45 


58 


1 


40 


44 


42 


54 



\ 



Tlii^ talilc 1*$ but sli;;ltt!y ndnptcd from tnlflrji of tt^nns foumi 
in tlic ifuhlMic<l tttanti.il^ f<ir a tv>\ of tCii^U^lt i;r.'ifttmnr aiiil 
n U>l of French rt'adiii^ eoitt|)r<*iu*tt»tott. i'he jtoivs nrc 
jitattfiani MMirc.«» li;Hr«l ott u Miiglv M;idt;, Htlh a Maitdartl dcvi- 
ittioit cf appnixiinatclx 10« 



7 



TEST SKRVlCn BULLI-TIN 



the same psychological or ediicntionnl (iiialitics in tlic 
individual as do scores fioni ihc other test. Most pre- 
cisely, two tests arc completely equivalent if their content 
is essentiiilly identical and they nieasiire with equal pre- 
cision (reliability). If these condii ons are met, it docs 
not matter which of the two tests is used. These con- 
ditions aic ordinarily most closely approximated where 
parnllcl forms of a test have been constnicted — forms 
which are intended to be interchangeable. 

When parallel forms of p. test arc available, there is 
ordinarily the implicit, if not explicit, assimiption that 
these forms are actually interehaimeable. This means 
that we have no basis foi suggesting that a person take 
one form rather than another— the information obtained 
will be of equal value whether Form A or Form B is ad- 
ministered. The specific items in one form arc of no 
greater significance than the items which happen to be 
-in the alternate form. 

Assumptions we can make with regard to content and 
reliability of parallel forms of one test are not readily 
acceptable wlien we arc dealing with two somewhat 
different tcbts of the same general ability. This situation 
is one in which the problem of equivalence frequently 
arises. For example, a counselor may have reading com- 
prehension scores fioni the Stitnfuni Achievement Test 
for some pupils, and scores from the Jowa Silent Read- 
ing Teat for other pupils; or, an industrial organization 
may wish to substitute a modern clerical aptitude test 
for an outmoded one. In such cases, it is important to 
know the degree of equivalence of the scores from the 
two reading tests, or tlie two clerical tests. 

In these circumstances, the size of the coefTicient of 
correlation between the tests is of prime importance. 
Obviously, lack of perfect reliability in each of the tests 
will prevent the correlation coeflicient ^rom reaching 
1.00. Even disregarding the elTects of unreliability, how- 
ever, the correlation would still be less than perfect 
because each reading test was constructed somewhat 
differently from the other; the two clerical tests were 
also prepared according to distinctly different plans. The 
greater the divergence in specific abilities measured, 
the more ambiguous the term '^equivalent'* becomes. 

If the correlation coeflicient is 1.00, we can say with 
complete confidence that all persons who score in, .say, 
the sixth decile (5 1st to 60th percentiles) on one test will 
also score in the sixth ddrilc on the other. If the co- 
efficient is ,90, we may expect that, of those who score 
in the sixth decile on one test. 22.5% will score in the 
sixth decile on the other test; the remaining 77.5% will 
be distributed as follows: approximately 20% each in 
the fifth and seventh deciles, about 13% in the fourth 
and eighth deciics, and the remainder in the second. 



third, ninth, and tenth decile^. If the coeHicicnt is .75, 
of tho^c who score in the ^i\!h decile on tlie first test, 
we n)ay expect 15.2 ' to score in the si.vth decile on 
the second test. The other cxuminecs would be found 
in the first decile {1.0(7 ), the second decile (6.0%), 
the third (9.6% ), the fourth (12.4%;), the fifth (14.2%), 
the scNcnih (14.6%), the eighth (12.9a), the ninth 
(9.4% ). and the teiuh decile (3.8% ). Jn these cireum- 
stancch,.wc cannot say that an individual w-ill certainly 
achieve tlic same .score on one test as he docs on another. 
Instead, we can speak only of the piohahiUty that 
people who make a certain score on one test will ob- 
tain various scores on the other.* 

In practice it would be extremely awkward to pre- 
sent a tabic of equivalents in terms of these probabilities. 
To simplify nintlers, \vc present pairs of individual scores 
as equivalents — usually based on .the cqui-percentile 
method or a variant of it. That is, we find for a given 
group those scores which arc at the 4()lh percentile on 
forms A and B, and prcbcnl those scores as equivalent. 
What distinguishes the procedure from that in which we 
obtain comparability of scoics is that we have in the 
equiv;jlcnce table the assumption that w*hat is being 
measured is the same in the two forms. 

Docs tills inean that an older test c?»nnot be replaced 
by a newer and presumably better test? Not at all. To 
persist in the use of instalments when mor^ valid or 
more efficient tests become available is poor practice. 
Tlie heart of any test tisc is validity—whctlicr the test is 
doing what it is intended to do. If test N can offer ap- 
preciably better prediction thaii test Q, test N should 
replace test Q in the particular situation; in this case 
we do not want a truly equivalent test— we want a better 
test. If we have had a good deal of experience with test 
Q, we may v/ish to know the relative rank represented 
by specific scores on tests Q and N. If wc have used a 
cutoff score on test 0, we may wish to know what score 
on test N would eliminate a similar proportion of the 
applicants. This information can be obtained by giving 
both tests to the sanie population, or to two very similar 
populations. 

The resulting table of matched scores is a table of 
comparability. To evaluate the degree to which the table 
is also a table of equivalents, we need to know the coeffi- 
cient of correlation between the two sets of scores. If the 
scores arc comparable and we use the same cutoff score 
on test N that wc used for test 0* we will accept the same 
number of applicants. Because the tests arc not perfectly 



^Thc above .statements apply to alternate fornix; of tests as well 
as to tests intended to nieaMiie soniewhai dilfcrcnt ahlHtics. If 
alternate forms of a icsi correlate .1^, the per cenis to be expected 
in each decile will be the same as for a coellicient of .75 between 
non'-parallcl tests. ^ 



8 



TEST SERVICE BULLETIN 



reliable noi precisely equivalent, wc will not accept pre- 
cisely the same indivitluals b) means of the two tests - 
and because test N is more valid, wc will accept a larger 
number .of good applicants and a smaller .iiuiaber of 
prospective failures. Tlii;; is nn outcome much to be 
desired. Wc are obviousl> not seeking precise equiva- 
lence. Wc are happy to trade some precision in the 
equivalence for some improvement in validity. 

To sunnnanze, coniparablc scores represent equal 
rank in a given population-tliey imply nothing concern- 



mg what is being measured, i-iquiva^cnt scoies icpre- 
sent similarity of what is being measured — the more 
complete the equivalence. Mie greater the likeness of 
njeasiircmcnt. The test user wouid do well to keep these 
distinctions in mind; to avoid being confused into accept- 
ing comparable scores as denoting equivalence of meas- 
urement; tc insist on close approximations to complete 
equivalence in alternate forms; and to recognize that in 
substituting one test for another lie may prefer rough 
equivalence to more precise equivalence if he is seek- 
ing to improve validity.— A.GAV. 



A Note Reading Test for Use in High Schooh and Colleges 



m DAVJS READING TEST 



\ 



\ 



FuKDEuicK B. Davis ani> Charlotte Cuoon Davis 

Carefully constructed to measure the reading skills 
of college fieshmcn and high school juniors and senio 
this new reading test pri>vidcs scores in; 

1 . Level of comprehension 

2. Speed of comprehension 
The Level score indicates the depth of understandijlig 
displayed b> a student in reading the kinds of material 
he is ordinarily required to read in high school ^nd 
college; the Speed score indicates the rapidity and ac- 
curacy with which he understands the same material. 

Passages varying in length from five to thirty lines 
are used as a basis for multiple-choice items measuring 
five categories of reading skills: / 

► Finding the answers to questions answered (ex- 
plicitly or in paraphrase) in a passage; / 

►Weaving together the ideas in a passage and 
grasping its central thought; / 

►Making inferences about the subject/of a pa<;- 
sage and about its author's purpose or view- 
point; 

► Recognizing the tone and mood of a passage 
and the literary devices used by its author; 

► Following the structure of a passage, as in iden- 
tifying antecedents and referents, 

The Davis Reading Test is available in four equiv- 
alent forms. In each form of the test, the first and second 
halves have been carefully equated. ^Within the 40- 
minutc time limit nearly every examinee completes the 
first half, and the Level of Comprehension score is based 




on this portion. The Speed of Comprehension score is 
based on tlie whole test. 

The test may be scoicfl quickly and easily either by 
machine or by hand. Rav.' scores aic convcued into 
ipaled scores representing the same relative amounts of 
ability in cither Level of Comprehension or Speed of 
Coihprehcnsion. regardless of *.'ie form of the test used. 
Perccfitile norms are provided for students in the elev- 
enth an^ twelfth grades and for college ficshmen. The 
standardi^ition is based on over 18,000 students in 18 
colleges and 29 high scliools in 25 states. 

Marked discrepancies between scaled scores for Level 
and for Specd^or scores which are markedly low for the 
student's crade,indicate a need for individual diagnosis 
and remedial reading helj? Further information of diag- 
nostic value mayXbe obtained by comparing Davis 
Readini^t Test percentiles with results on tile Colkf^c 
QuaUfication Tests. \ 

\ 

In 1963, Series 2 became, available for grades 8-11, 
to supplement Series I at the grade 1 1 l}o»C4)lleg^ level. 
There are four forms at each level : 

Grades 8-11: Forms 2A, 2B, 2C, 2D 
Grades 11-13: Forms lA, IB, IC, ID 
For packaging and prices of the test booklets, answer 
sheets, and accessories, see the Test Catalog. 



